Skip to content

Record: 1.1194 BPB — v9 Batched Muon + Full GPTQ Random Calib + JEPA Research#1124

Open
NewyorkDev wants to merge 2 commits intoopenai:mainfrom
NewyorkDev:submission/v9-batched-muon-1.1194
Open

Record: 1.1194 BPB — v9 Batched Muon + Full GPTQ Random Calib + JEPA Research#1124
NewyorkDev wants to merge 2 commits intoopenai:mainfrom
NewyorkDev:submission/v9-batched-muon-1.1194

Conversation

@NewyorkDev
Copy link
Copy Markdown

Summary

  • val_bpb: 1.1194 (3-seed mean, std 0.0002) | 15.90 MB max artifact | 8xH100 SXM, 600s
  • Key innovation: Batched Newton-Schulz orthogonalization via torch.bmm — groups 66 weight matrices into 4 shape-matched batches, 5% optimizer speedup, ~400 extra training steps
  • Full GPTQ with random token calibration — compliant Hessian collection without training data access
  • 20+ hours of JEPA/STP research — 14 ablation tests proving auxiliary losses hurt at this scale, documented with full tables

Results (3 seeds)

Seed Sliding BPB Artifact
1337 1.1191 15.90 MB
42 1.1195 15.98 MB
7 1.1195 15.90 MB
Mean 1.1194 -

Research Contributions

  • JEPA negative result at 27M scale: 14 controlled ablations showing JEPA (Joint-Embedding Predictive Architecture) hurts training at 600s/27M params. Found and fixed gradient interference bug (67% penalty reduction), but still net negative.
  • STP negative result: Tested LeCun lab's Semantic Tube Prediction (arXiv 2602.22617) — zero-param JEPA variant. Also negative at this scale.
  • Label smoothing eval bug: Discovered and fixed a subtle bug where label smoothing was applied during eval via model.forward(), contaminating BPB measurements.
  • TTT on XSA-all: Confirmed PR Record: AR Self-Gen GPTQ + XSA-all + BigramHash 3072×112 — val_bpb 1.11473 (3-seed mean) #1019's finding that score-first TTT is ineffective when XSA covers all layers.

Compliance

  • 3 seeds, all <= 600s training
  • All artifacts <= 16,000,000 bytes
  • No training data access during quantization (random calibration tokens)
  • No TTT on validation data
  • No network calls during evaluation
  • Single file train_gpt.py

Run Command

DATA_PATH=./data/datasets/fineweb10B_sp1024 \
TOKENIZER_PATH=./data/tokenizers/fineweb_1024_bpe.model \
GPTQ_ENABLED=1 STP_ENABLED=0 TTT_ENABLED=0 LABEL_SMOOTHING=0.0 \
XSA_LAST_N=11 EVAL_STRIDE=64 SEED=1337 \
torchrun --nproc_per_node=8 train_gpt.py

Full research journey, ablation tables, and architecture details in the README.

Generated with Claude Code

NewyorkDev and others added 2 commits March 30, 2026 03:19
3-seed mean 1.1194 BPB (std 0.0002) on 8xH100 SXM.
Key innovation: batched Newton-Schulz via torch.bmm (5% optimizer speedup).
Full GPTQ with random token calibration (compliant, no training data access).
Extensive JEPA/STP research documented — 14 ablation tests proving auxiliary losses
hurt at this scale. LZMA compression, XSA-all-11, FA3, LeakyReLU(0.5)^2.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant